Skip to content

Conversation

@zufayu
Copy link
Contributor

@zufayu zufayu commented Jan 28, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an FP32-output (likely FP32-accumulation) path for the fused MoE g1u1 per-token int8 SiLU kernel on gfx942, wiring a new hsaco + CSV config into the existing heuristic kernel-selection pipeline and exposing it through the Python/C++ APIs.

Changes:

  • Add gfx942 FP32 per-token int8 g1u1 SiLU hsaco (.co) and its kernel list CSV.
  • Extend fmoe_g1u1_a16 (C++/HIP) to select an FP32 config map and launch with T_O=float when out is FP32.
  • Update Python asm_moe to optionally allocate moe_buf as FP32 and cast back to the original dtype after execution.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.

File Description
hsa/gfx942/fmoe/silu/fmoe_fp32_pertokenInt8_g1u1_vs_smf_silu_1tg_32x384.co Adds the gfx942 FP32 SiLU g1u1 per-token int8 assembled kernel blob.
hsa/gfx942/fmoe/silu/fmoe_fp32_pertokenInt8_g1u1_silu.csv Registers the new FP32 kernel in the asm config generation pipeline.
csrc/py_itfs_cu/asm_fmoe.cu Enables selecting/launching the FP32-output kernel variant for fmoe_g1u1_a16.
aiter/fused_moe_bf16_asm.py Adds a Python-side heuristic to allocate FP32 output buffer and convert back to the original dtype.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 80 to 90
is_g1u1 = (
w2.shape[2] * 2 * lastdim_mul == w1.shape[1] and fc2_smooth_scale is not None
)
enable_fp32 = (
fc2_smooth_scale is not None
and is_g1u1
and (inter_dim % 384 == 0)
and w1.dtype == dtypes.i8
and a16
)
moebuf_dtype = torch.float32 if enable_fp32 else dtype
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enable_fp32 can currently become True even when activation is GELU or when running on non-gfx942 GPUs. In those cases moe_buf is allocated as FP32 and routed into aiter.fmoe_g1u1_a16, but the C++ side only wires up FP32 output support for (int8 weights, Silu) and only has an FP32 hsaco listed for gfx942; this will reliably throw at runtime (unsupported activation/dtype or “No suitable kernel found”). Gate enable_fp32 at least on activation == ActivationType.Silu and the current arch (e.g. get_gfx() == "gfx942"), or add a safe fallback when the FP32 kernel isn’t available.

Copilot uses AI. Check for mistakes.
Comment on lines +774 to 776
else if(out.dtype() == at::ScalarType::Float && activation == ActivationType::Silu)
config_map = &cfg_fmoe_fp32_pertokenInt8_g1u1_silu;
else
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FP32 output is now selectable here, but the only shipped FP32 hsaco in this PR is for gfx942 and subGU_n=384 (see hsa/gfx942/fmoe/silu/fmoe_fp32_pertokenInt8_g1u1_silu.csv). If callers pass out as FP32 on other arches or with an inter_dim not divisible by 384, this will fail later with a generic “No suitable kernel found”. Consider adding an explicit early TORCH_CHECK documenting the FP32 constraints (arch + supported inter_dim tiles) so the failure mode is clearer.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants